1. Spark SQL for ETL

	1.1 Using Spark with RDBMS
	
		1.1.1 Open the command-line terminal, run the following command:
		
			mysql -u root -p 
			
			(password is hadoop)
			
			create database spark;
			
			use spark;
			
			create table online_retail(id integer not null auto_increment, invoiceNo varchar(20), stockCode varchar(20),
					description varchar(255), quantity integer, unitPrice double, customerID varchar(20), country varchar(100), invoiceDate Timestamp, 
					primary key(id));
					
		1.1.2 Open the new command-line terminal, run the following command:

			cd /root
			
			wget https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx
		
			convert Online%20Retail.xlsx to CSV file named online_retail.csv using Microsoft Excel - Save as Text tab delmited
			
			Upload online_retail.csv into Sandbox /root
			
			hadoop fs -put /root/online_retail.csv /tmp
			
			
		1.1.3 Run the following command:

			spark-submit --jars /usr/hdp/2.6.3.0-235/hive/lib/mysql-connector-java.jar --class ca.training.bigdata.spark.sql.etl.UsingSparkwithRDBMS --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
		
		1.1.4 Go back to 1.1.1 terminal, run the following terminal:

			select * from spark.online_retail limit 10;
			
		1.1.5 Open the new command-line terminal, run the following command:

			hadoop fs -ls /tmp/online_retail
			
	1.2 Processing JSON Data

		1.2.1 Open the new command-line terminal, run the following command:
		
			cd /root
			
			wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_S.json.gz
			
			gunzip reviews_Musical_Instruments_5.json.gz
			
			hadoop fs -put /root/reviews_Musical_Instruments_5.json /tmp
		
		1.2.2 Open the command-line terminal, run the following command:

			mysql -u root -p
			
			create database spark;

		1.2.3 Run the following command:

			spark-submit --jars /usr/hdp/2.6.3.0-235/hive/lib/mysql-connector-java.jar --class ca.training.bigdata.spark.sql.etl.ProcessingJSONData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
		1.2.4 Go back to 1.2.2 terminal, run the following command:

			select * from spark.musical_instruments_positive_reviews limit 10;
			select * from spark.musical_instruments_helpful_reviews limit 10;
		
		
	1.3 Processing CSV Data

		1.3.1 Open the command-line terminal, run the following command:
		
			mysql -u root -p
			
			create database spark;
			
			use spark;
			
			create table online_retail(id integer not null auto_increment, invoiceNovarchar(20), stockCodevarchar(20),
					description varchar(255), quantity integer, unitPrice double, customerIDvarchar(20), country varchar(100), invoiceDate Timestamp, 
					primary key(id));
					
		1.3.2 Open the new command-line terminal, run the following command:

			cd /root
			
			wget https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx
		
			convert Online%20Retail.xlsx to CSV file named online_retail.csv
			
			hadoop fs -put /root/online_retail.csv /tmp
			
			
		1.3.3 Run the following command:

			spark-submit --jars /usr/hdp/2.6.3.0-235/hive/lib/mysql-connector-java.jar --class ca.training.bigdata.spark.sql.etl.ProcessingCSVData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
		
		1.3.4 Go back to 1.3.1 terminal, run the following terminal:

			select * from spark.online_retail limit 10;
			
	1.4 Processing XML Data

		1.4.1 Open the command-line terminal, run the following command:
		
			cd /root
			
			wget https://archive.ics.uci.edu/ml/machine-learning-databases/00239/corpus.zip
			
			unzip -x corpus.zip
			
		1.4.2 Run the following command:

			spark-submit --jars /usr/hdp/2.6.3.0-235/hive/lib/mysql-connector-java.jar --class ca.training.bigdata.spark.sql.etl.ProcessingXMLData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
	1.5 Processing Avro files

		1.5.1 Open the new command-line terminal, run the following command:
		
			cd /root
			
			wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_5.json.gz
			
			gunzip reviews_Musical_Instruments_5.json.gz
			
			hadoop fs -put /root/reviews_Musical_Instruments_5.json /tmp
			
		1.5.2 Get Avro Tool

			cd /root
			
			wget http://central.maven.org/maven2/org/apache/avro/avro-tools/1.8.2/avro-tools-1.8.2.jar
			
		1.5.3 Run the following command:

			spark-submit --jars /usr/hdp/2.6.3.0-235/hive/lib/mysql-connector-java.jar --class ca.training.bigdata.spark.sql.etl.ProcessingAvroData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
		1.5.4 Open the new command-line terminal, run the following command:

			1.5.4.1 Check the result:
				
				ls -l /tmp/amazon_reviews/avro
				
				there is one avro file is generated there (part-00000-xxxxx.avro)
				
			1.5.4.2 Print out the avro schema
				
				java -jar /root/avro-tools-1.8.2.jar getschema /tmp/amazon_reviews/avro/part-00000-xxxxx.avro

			1.5.4.3 Print out the metadata
				
				java -jar /root/avro-tools-1.8.2.jar getmeta /tmp/amazon_reviews/avro/part-00000-xxxxx.avro

			1.5.4.4 Dumps an Avro data file as JSON
				
				java -jar /root/avro-tools-1.8.2.jar tojson /tmp/amazon_reviews/avro/part-00000-xxxxx.avro			

				
	1.6 Processing Parquet files				
	
		1.6.1 Open the new command-line terminal, run the following command:
		
			cd /root
			
			wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_5.json.gz
			
			gunzip reviews_Musical_Instruments_5.json.gz
			
			hadoop fs -put /root/reviews_Musical_Instruments_5.json /tmp
			
		1.6.2 Get Parquet Tool

			cd /root
			
			wget http://central.maven.org/maven2/org/apache/parquet/parquet-tools/1.9.0/parquet-tools-1.9.0.jar
			
		1.6.3 Run the following command:

			spark-submit --jars /usr/hdp/2.6.3.0-235/hive/lib/mysql-connector-java.jar --class ca.training.bigdata.spark.sql.etl.ProcessingParquetData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
				
		1.6.4 Open the new command-line terminal, run the following command:

			1.6.4.1 Check the result:				
			
				hadoop fs -ls /tmp/amazon_revews/parquet
				
				there is one parquet file is generated there (part-00000-xxxxxx.snappy.parquet)
				
			1.6.4.2 Print out the parquet metadata

				hadoop jar /root/parquet-tools-1.9.0.jar meta /tmp/amazon_revews/parquet/part-00000-xxxxxx.snappy.parquet
				
				
				Note:
				
					RC: Row count
					TS: Size in bytes
					SZ: Column chunk size (compressed/uncompressed/ratio)
					

	1.7 Processing Complex Data Types

		1.7.1 Functions
		
				get_json_object()
				from_json()
				to_json()
				selectExpr()
				explode()
				create pivot and unpivot tables
				
		1.7.2 Start Kafka service from ambari console

		1.7.3 Open new command-line terminal to create the topic "iot-devices" and "device_alerts"

			/usr/hdp/2.6.3.0-235/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic iot-devices
			
			/usr/hdp/2.6.3.0-235/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic device_alerts
			
		1.7.4 Launch the new command-line terminal, start a consumer script listening to iot-devices

			/usr/hdp/2.6.3.0-235/kafka/bin/kafka-console-consumer.sh --bootstrap-server sandbox-hdp.hortonworks.com:6667 --from-beginning --topic iot-devices
			
		1.7.5 Launch the new command-line terminal, start a consumer script listening to device_alerts
		
			/usr/hdp/2.6.3.0-235/kafka/bin/kafka-console-consumer.sh --bootstrap-server sandbox-hdp.hortonworks.com:6667 --from-beginning --topic device_alerts
			
		1.7.6 Run the following command:

			spark-submit --class ca.training.bigdata.spark.sql.etl.ProcessingComplexDataTypes --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
		1.7.7 Run the following command:	

			spark-submit --class ca.training.bigdata.spark.sql.etl.ProcessingIoTComplexDataTypes --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		

2. Spark SQL for Exploratory Data Analysis

	2.1 Using Spark SQL for basic data analysis

		2.1.1 Preparation of Data
		
			cd /root
			wget https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
			unzip -x bank-additional.zip
			cp bank-additional/bank-additional-full.csv bank-additional-full.csv
			
			hadoop fs -put bank-additional-full.csv /root
			
		2.1.2 Retrieving Data

			spark-submit --class ca.training.bigdata.spark.sql.eda.RetrievingData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
		2.1.3 Identifying Missing Data
		
			spark-submit --class ca.training.bigdata.spark.sql.eda.IdentifyingMissingData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
		2.1.4 Computing Basic Statistics

			spark-submit --class ca.training.bigdata.spark.sql.eda.ComputingBasicStatistics --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
		2.1.5 Identifying Data Outliers

			spark-submit --class ca.training.bigdata.spark.sql.eda.IdentifyingDataOutliers --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
	2.2 Visualizing data with Apache Zeppelin

		2.2.1 Start Zeppelin service from Ambari console
		
		2.2.2 Import /root/TrainingOnHDP/SparkSQLforETL/zeppelin/Visualizing data with Apache Zeppelin.json
		
		2.2.3 Open Visualizing data with Apache Zeppelin note and click run all paragraphs button
		
	2.3 Sampling data with Spark SQL APIs

		2.3.1 Sampling with Dataset API section
		
			spark-submit --class ca.training.bigdata.spark.sql.eda.SamplingwithDataset --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
		2.3.2 Sampling with RDD API section

			spark-submit --class ca.training.bigdata.spark.sql.eda.SamplingwithRDD --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
		2.3.3 Using Spark SQL for creating pivot tables

			spark-submit --class ca.training.bigdata.spark.sql.eda.CreatingPivotTables --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
3. Spark SQL for Data Munging

	3.1 Explore data munging techniques

		3.1.1 Pre-process the data
		
			cd /root
			wget
			https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip
			unzip -x household_power_consumption.zip
			
		3.1.2 Retrieving Data
			
			spark-submit --class ca.training.bigdata.spark.sql.dm.RetrievingData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
		3.1.3 Compute basic statistics and aggregations

			spark-submit --class ca.training.bigdata.spark.sql.dm.ComputingBasicStatistics --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
		3.1.4 Enrichment the Dataset

			spark-submit --class ca.training.bigdata.spark.sql.dm.EnrichmentDataset --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
		3.1.5 Analyze missing data
		
			spark-submit --class ca.training.bigdata.spark.sql.dm.AnalyzeMissingData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
		3.1.6 Combine the Dataset using joins
		
			spark-submit --class ca.training.bigdata.spark.sql.dm.CombineDataset --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
				
				
	3.2 Munging on textual data

		3.2.1 Pre-process the data
		
			cd /root
			wget
			https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroups-mld/20_newsgroups.tar.gz
			gunzip 20_newsgroups.tar.gz
			tar -xvf 20_newsgroups.tar
			
			hadoop fs -mkdir /user/root/20_newsgroups
			hadoop fs -mkdir /user/root/20_newsgroups/comp.graphics
			hadoop fs -put /root/20_newsgroups/comp.graphics/* /user/root/20_newsgroups/comp.graphics
			
			wget http://algs4.cs.princeton.edu/35applications/stopwords.txt
			
		3.2.2 Processing multiple input data files

			spark-submit --class ca.training.bigdata.spark.core.munging.textual.ProcessingMultipleInputData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
		3.2.3 Removing stop words
		
			spark-submit --class ca.training.bigdata.spark.core.munging.textual.RemovingStopWords --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
	3.3 Munging on time-series data

		3.3.1 Pre-processing of the time-series dataset
		
			spark-submit --class ca.training.bigdata.spark.sql.munging.timeseries.PreProcessingDataSet --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
		3.3.2 Processing data fields	
		
			spark-submit --class ca.training.bigdata.spark.sql.munging.timeseries.ProcessingDataFields --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
		3.3.3 Persisting and loading data
		
			spark-submit --class ca.training.bigdata.spark.sql.munging.timeseries.PersistingLoadingData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
		3.3.4 Create Time Series RDD
		
			spark-submit --class ca.training.bigdata.spark.sql.munging.timeseries.CreateTimeSeriesRDD --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
		3.3.5 Computing Basic Statistics
		
			spark-submit --class ca.training.bigdata.spark.sql.munging.timeseries.ComputingBasicStatistics --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
	3.4 Dealing with variable length records

		3.4.1 Pre-process the data
		
			cd /root
			wget
			https://archive.ics.uci.edu/ml/machine-learning-database/00337/sNewsListWResults3yr.zip
			unzip -x sNewsListWResults3yr.zip
			unzip -x sNewsListWResults2012.zip
			unzip -x sNewsListWResults2013.zip
			unzip -x sNewsListWResults2014.zip
			mkdir sNewsListWResults
			cp sNewsListWResults2012/* sNewsListWResults
			cp sNewsListWResults2013/* sNewsListWResults
			cp sNewsListWResults2014/* sNewsListWResults
			
		3.4.2 Converting variable-length records to fixed-length records and De-Duplicate the records
		
			spark-submit --class ca.training.bigdata.spark.sql.munging.timeseries.ConvertingToFixLengthRecords --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
		3.4.3 Extracting data from "messy" columns

			spark-submit --class ca.training.bigdata.spark.sql.munging.timeseries.ExtractingFromMessyColumns --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
4 Data preparation for machine learning pipelines

	4.1 Pre-processing data for machine learning
		
		spark-submit --class ca.training.bigdata.spark.sql.ml.PreProcessingData --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
	4.2 Creating and running a machine learning pipeline
		
		spark-submit --class ca.training.bigdata.spark.sql.ml.RunningMachineLearning --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
5 Fuzzy	Matching Analysis

	5.1 Soundex Phonetic Algorithm
	
		Based on English pronunciation - words that are pronounced the same, but spelled differently
		
		spark-submit --class ca.training.bigdata.spark.sql.fuzzymatch.SoundexAlgorithm --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
	5.2 Other Phonetic Algorithms

		double_metaphone
		nysiis
		refined_soundex
		
		spark-submit --class ca.training.bigdata.spark.sql.fuzzymatch.PhoneticAlgorithm --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
	5.3 Levenshtein Similarity

		String Metrics for measuing the difference between two sequences
		
		spark-submit --class ca.training.bigdata.spark.sql.fuzzymatch.LevenshteinDistance --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar

	5.4 Other Similarity Functions
		
		cosine_distance
		fuzzy_score
		jaccard_similarity
		jaro_winkler
		
		spark-submit --class ca.training.bigdata.spark.sql.fuzzymatch.SimilarityFunctions --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
	5.5 Sentence Similarity using Word2Vec

		5.5.1 Download GoogleNews-vestors-negative300.bin.gz
		
			wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
			
		5.5.2 Upload GoogleNews-vestors-negative300.bin.gz into /root on your sandbox

		5.5.3 Login localhost:4200 using root

			gunzip GoogleNews-vestors-negative300.bin.gz		
			
		5.5.4 Convert GoogleNews-vestors-negative300.bin.gz into GoogleNews-vestors-negative300.bin.tsv

			./convertvec bin2txt GoogleNews-vestors-negative300.bin GoogleNews-vestors-negative300.tsv
			
		5.5.5 Run the following command in the command-line terminal:

			hadoop fs -put /root/sentence_pairs.txt /tmp
			hadoop fs -put /root/stopwords.txt /tmp
			hadoop fs -put /root/GoogleNews-vestors-negative300.tsv /tmp
			
		5.5.6 Run the following command:

			spark-submit --class ca.training.bigdata.spark.sql.fuzzymatch.SentenceSimilarity --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
			
			
6 Spark SQL	Performance Tuning

	6.1 Optimizing data serialization

		wget http://goo.gl/lwgoxw
		
		unzip -x lwgoxw
		
		spark-submit --class ca.training.bigdata.spark.sql.tuning.OptimizingDataSerialization --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar

	6.2 Understanding the Dataset/DataFrame API

		Enter Spark Shell and run the following command:
		
			val t1 = spark.range(7)
			t1.explain()
			t1.explain(extended=true)
			t1.filter("id != 0").filter("id != 2").explain(true)
			
	6.3 Understanding Catalyst transformation

		Enter Spark Shell and run the following command:

			val t0 = spark.range(0, 10000000)
			val df1 = t0.withColumn("uniform", rand(seed=10))
			val df2 = t0.withColumn("normal", rand(seed=27))
			df1.createOrReplaceTempView("t1")
			df2.createOrReplaceTempView("t2")
			
			spark.sql("select sum(v) from (select t1.id, 1 + t1.uniform as v from t1 join t2 where t1.id = t2.id and t2.id > 5000000) tmp").explain(true)
			
			vl t1 = spark.range(7)
			vl t2 = spark.range(13)
			vl t3 = spark.range(19)
			vl t4 = spark.range(1e8.toLong)
			vl t5 = spark.range(1e8.toLong)
			vl t6 = spark.range(1e3.toLong)
			
			val query = t1.join(t2).where(t1("id") === t2("id")).join(t3).where(t3("id") === t1("id"))).explain()
			val query = t1.join(t2).where(t1("id") === t2("id")).join(t3).where(t3("id") === t1("id"))).count()
			val query = t4.join(t5).where(t4("id") === t5("id")).join(t6).where(t4("id") === t6("id"))).explain(true)
			
	6.4 Cost-based Optimization

		Enter Spark Shell and run the following command:

			spark.sql("describe extended foodmart.customer").collect.foreach(println)
			spark.sql("analyze table foodmart.customer compute statistics")
			
	6.5 Build side selection

		Enter Spark Shell and run the following command:
		
			spark.sql("drop table if exists t1")
			spark.sql("drop table if exists t2")
			spark.sql("create table if not exists t1(id long, value long) using parquet")
			spark.sql("create table if not exists t2(id long, value long) using parquet")
			spark.range(5E8.toLong).select('id, (rand(17) * 1E6) cast "long").write.mode("overwrite").insertInto("t1")
			spark.range(1E8.toLong).select('id, 'id cast "string").write.mode("overwrite").insertInto("t2")
			spark.sql("select t1.id from t1, t2 where t1.id = t2.id and t1.value = 100").explan()
			
	6.6 JOIN ordering optimization

		spark-submit --class ca.training.bigdata.spark.sql.tuning.JoinOrderingOptimization --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
	
		spark-submit --conf spark.sql.cbo.enabled=true --class ca.training.bigdata.spark.sql.tuning.JoinOrderingOptimization --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar

		spark-submit --conf spark.sql.cbo.enabled=true --conf spark.sql.cbo.joinReorder=true --class ca.training.bigdata.spark.sql.tuning.JoinOrderingOptimization --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
	6.7 Whole-stage Code Generation

		spark-submit --conf spark.sql.codegen.wholeStage=false --class ca.training.bigdata.spark.sql.tuning.WholeStageCodeGeneration --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
	
		spark-submit --conf spark.sql.codegen.wholeStage=true --class ca.training.bigdata.spark.sql.tuning.WholeStageCodeGeneration --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
	
7 Extend Spark SQL using Custom Strategy

	7.1 Run the following command:

		spark-submit --class ca.training.bigdata.spark.sql.catalyst.ExtendSparkSQL --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
		
		A couple of messages like "Custom Strategy" will be print out
		
	7.2 Run the following command:

		spark-submit --class ca.training.bigdata.spark.sql.catalyst.RegularSimpleJoin --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
	
		It will take a couple of seconds to print out the reuslt since it is a huge join
		
	7.3 Run the following command:

		spark-submit --class ca.training.bigdata.spark.sql.catalyst.SmartSimpleJoin --driver-memory 2G --executor-memory 2G --master local[1] /root/TrainingOnHDP/SparkSQLforETL/target/SparkSQLforETL-1.0-SNAPSHOT-jar-with-dependencies.jar
	
	

			
		
		
		
		